Round 1 & Round 2- Technical
🔹 Can you briefly introduce yourself and walk us through your journey as a Data Engineer so far?
🔹 Tell us about your current project—technologies used, data architecture, and your daily responsibilities.
🔹 What’s the average data volume you handle in your current project? How do you ensure the efficient processing of this data?
🔹 Can you explain the challenges you’ve faced in your projects and how you overcame them?
🔹 What is Azure Data Factory (ADF)? Would you classify it as an ETL or ELT tool? Why?
🔹What is a linked service in ADF? How do you create one?
🔹Can you explain the difference between a linked service and a dataset in ADF?
🔹What are Integration Runtimes (IR) in ADF? Can you explain the different types and their use cases?
🔹What are triggers in ADF, and how have you used them in your project? Can you explain tumbling window triggers?
🔹 How do you manage moving ADF pipelines from development to production? Have you used ARM templates for deployment?
🔹 Write a SQL query to find the second-highest salary in a table.
🔹Given a Python list of numbers, write a function to find and return the top 3 largest numbers.
🔹 Can you explain how you handle data validation using SQL or Python in your current project?
🔹 How do you manage schema changes in PySpark when processing data over time?
🔹 Why is RDD considered resilient and fault-tolerant? Can you explain how Spark ensures data reliability?
🔹What is lazy evaluation in Spark? How does it affect performance in PySpark applications?
🔹 Can you differentiate between persist()
and cache()
in Spark? When would you use each?
🔹 What is a mount point in Azure Databricks? How do you mount ADLS Gen2 to Databricks?
🔹 Can you explain the difference between reduceByKey()
and groupByKey()
in Spark?
🔹 What are the key differences between DataFrames and RDDs in PySpark?
Round 3 - Managerial
🔹 Can you introduce yourself and provide a brief overview of your professional background as a Data Engineer?
🔹How large is your team, and what are the roles of other team members in your project?
🔹Describe the architecture of your current project. What was your role in designing or implementing it?
🔹 What is the difference between a Star Schema and a Snowflake Schema? Which one have you implemented in your data warehouse projects, and why?
🔹 How do ROW_NUMBER()
, RANK()
, and DENSE_RANK()
differ, and when would you use each in a project scenario?
🔹How do you ensure smooth communication and collaboration between different stakeholders, like data scientists, business teams, and developers?
Round 4 - HR Discussion
🔹 Salary negotiations
🔹 Notice period